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Abstract 

The spectral energy distribution (SED) is a relatively easy way for astronomers to distinguish be¬ 
tween different astronomical objects such as galaxies, black holes, and stellar objects. By comparing 
the observations from a source at different frequencies with template models, astronomers are able to 
infer the type of this observed object. In this paper, we take a Bayesian model averaging perspective to 
learn astronomical objects, employing a Bayesian nonparametric approach to accommodate the devia¬ 
tion from convex combinations of known log-SEDs. To effectively use telescope time for observations, 
we then study Bayesian nonparametric sequential experimental design without conjugacy, in which we 
use sequential Monte Carlo as an efficient tool to maximize the volume of information stored in the 
posterior distribution of the parameters of interest. A new technique for performing inferences in log- 
Gaussian Cox processes called the Poisson log-normal approximation is also proposed. Simulations 
show the speed, accuracy, and usefulness of our method. While the strategy we propose in this paper 
is brand new in the astronomy literature, the inferential techniques developed apply to more general 
nonparametric sequential experimental design problems. 

Keywords: Bayesian nonparametrie, sequential experimental design, sequential Monte Carlo, speetral en¬ 
ergy distribution, Bayesian model averaging, log-Gaussian Cox proeesses, Poisson log-normal approxima¬ 
tion 

1 Introduction 

Speetral energy distributions (SEDs), as well as their fitting, are used in many branehes of astronomy to 
eharaeterize astronomieal sourees. For example, photometrie redshift estimation (distanee estimation of 
sourees) relies heavily on the SED morphology of galaxies. Beeause of its strong predietive power and 
relative ease of use, SED fitting is very eommonly used in astronomy. 

Our scientific goal in this paper is to fit teleseope observations to various existing template SEDs, whieh 
are generated either from observations of known astronomieal objeets or models, in order to (i) elassify a 
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new astronomical source, (ii) analyze a new blended source as a geometrically weighted average of template 
models, or (iii) detect the evidence of a new type of SED which cannot be directly described by known 
templates. 

However, due to the high cost of using sophisticated telescopes and the limitation of observation time, it 
is necessary to carefully design the observational strategy using all the information we have, including the 
template models, the specifications of telescope filter^ and those existing observations. In the context of 
SED fitting, this is equivalent to specify the set of filters to use in order to better achieve the aforementioned 
scientific goal. Decisions regarding fhese specificafions musf be made before fhe dafa collecfion. Because 
specific informafion is usually available prior fo fhe use of fhe felescope, fhe Bayesian framework will play 
an imporfanf role. 

Especially, sequenfial design is preferred as opposed fo a non-sequenfial one. The following fhree 
advanfages motivate our choice of fhis mefhodology: Eirsfly, fhe optimal sequenfial design procedure musf 
be af teas! as good as a fixed design procedure ( [Chaloner and Verdinelli 1 1995[ ). Secondly, if is usually more 
compufalionally efficienl, as finding fhe opfimal non-sequenfial design for all design variables af a time is 
usually NP-hard ( |Ko ef al. 119951). Einally, a sequenfial design can also incorporafe fhe existing liferafures 
on mulli-armed bandif problems ( Robbins| | 1952|, Berry and Erisfedf 119851, Krause and Qng| |2011 1) and 
sequenfial Monte Carlo (SMC) ( (Cherkassky and Bornn |20131). 

Our statistical goal in fhis paper is fo provide an efficienl and fasl calculafion scheme fo reach fhe 
opfimal sequential design under fhe SED filling conlexl. One compulafional difficully comes from fhe 
incorporafion of a non-conjugate Bayesian nonparamelric prior on fhe devialion belween fhe (convex com- 
binafions of) femplafe models and fhe Irulh. If is fhis non-conjugale selling lhal makes our work differenl 
from fhe exisfing liferafures in Bayesian nonparametric sequential experimental design (BNS-ED), which 
usually assumes a Gaussian conjugacy. 

There are several major conlribufions we make in fhis paper. We are fhe firsf fo our knowledge fo 
sludy BNS-ED wilhoul conjugacy. Our second conlribufion is fo provide a fasl SMC algorilhm fo solve 
fhe general BNS-ED problems. Eurfhermore, by employing fhe special model slruclure of log-Gaussian 
Cox process (EGCP), fhe main model of inleresl in fhis paper, we inlroduce a new technique called Poisson 
log-normal approximation (PoENA) fo improve compulation speed. As a Ihird conlribufion, we apply fhe 
new mefhodology slaled above fo sequentially choose fhe besl fillers fo use and fil fhe SED on-lhe-fly. 
To show how our mefhod could be applied, we perform simulafion lesls on real asfronomical lemplafes 
and demonslrale lhal, using our algorilhm, one can beller analyze fhe unknown SED in terms of femplafe 


'Filters here mean the physical filters used with the detectors on the telescope in order to restrict the observed electromagnetic 
bandwidth, a range of frequencies. 
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(b) A three-level hierarchical (mixture) log-Gaussian 


|^(a~log-GP '^a>jjUj,k (Level 2) 


(Level 3) 


(a) Relationships between different quantities involved. Cox process. The dashed array here means a determin- 
Blue and orange lines mean some other SED templates, istic relationship. 

Figure 1: Graphical illustrations for the problem of SED fitting as well as its model setting. 

models with fewer observations, which shows the practical value of our methodology. 

The rest of this paper is organized as follows. In Section we formulate the main scientific problem 
of interest and specify the quantities and notations we will use throughout this paper. In Section we 
introduce the statistical model, design perspective, and general SMC inferential scheme for BNS-ED. We 
then introduce the specialized technique PoENA for EGCP in Section]^ Simulation examples are provided 
in Section 1^ Einally, we discuss several related works and conclude in Section]^ We leave more detailed 
calculations to the Supplementary Material. 

2 Motivation 

A graphical illustration for the problem of fitting a spectral energy distribution (SED) is shown in Eigure 

[Til 

The true but unknown SED from an astronomical source is denoted as A {u), where u represents the 
frequency of the photons and A {v) means the intensity of those arriving photons with frequency u. Several 
known templates are denoted as exp (i/)), where /ij’s are the template log-SEDs. The actual observation 
yt we collect from the telescope is the total number of photons observed using a particular filter with a 
certain bandwidth Bt = {ut — dt^vt + St), which centers around vt with a frequency range St- Eet Nt{-) 
denote the complete empirical distribution of photon arrivals emitted at time t with different frequencies. 
Then yt = Nt {Bt), because one can only observe the total counts of photons in the bandwidth Bt. 

Motivating astronomical problem. Use all of the collected data yt’s as well as the filters Bt’s to 
describe the unknown SED A {u) in terms of the existing SED templates, exp (/ij (z^))’s. 

Most existing astronomy literatures use frequentist model selection methodologies to determine which 
template could best represent the truth, that is, to find a unique i such that A {u) = exp (/ij (z^)). However 
these methods (i) fail to quantify the uncertainty of this sole template selection away from the true SED, 
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(ii) cannot take advantage of such uncertainty information to suggest the next step of observational setting, 
and (iii) ignore the possibility that the true SED might not be accounted for by those selected templates, as 
shown by the region of larger v in Figure [Ta| 

To address (ii), a naive and prevailing approach for collecting observations is to use each single available 
filter on the telescope and fit the SED until the completion of data collection. However, this might be 
expensive because of the time constraint on telescope availability (on average only 8 night hours per day). 
Multiple visits to the telescope due to an inefficient observational strategy will increase the monetary cost 
for astronomers. 

A unifying and adequate solution addressing all of the issues above is obtained through a Bayesian 
approach. Instead of choosing a sole template, we take a Bayesian model averaging perspective on those 
templates by assuming that A {u) = exp (^))> where u = (wi,..., uJraf" is the vector of mixture 

weights assigned to each template such that YllLi summarize the existing observations as a 

prior distribution on u and accordingly induce the posterior distribution on u by incorporating our current 
observations. For (i), the uncertainty information for the goodness-of-fit of the data could be extracted from 
the posterior distribution on u, e.g., the posterior probability intervals for each iVi. For (ii), the posterior 
distribution on a; also serves as the primary proxy for choosing the next filter to use. This is how Bayesian 
sequential experimental design comes into play. 

Finally, to address (iii), we can write A (u) = exp (^) + ^ (^))> where e (u) describes the 

deviation between the true log-SED and (convex combinations of) those selected templates. Due to the 
limited knowledge of the unstructured term e (z^), we naturally impose a weak prior on it, which requires the 
use of a Bayesian nonparametric prior and hence motivates our study of Bayesian nonparametric sequential 
experimental design (BNS-ED). 

3 Design and Inferential Scheme for BNS-ED 

In this Section, we will specify the statistical models as well as the utility function for BNS-ED. We then 
discuss the inference for BNS-ED in general without employing any specific feafures of our model. The 
derivations of equations Q, Q, and Q can be found in Supplementary Materials [A| 

3.1 Model Specification 

Due to the discrete nature of our observations, it is convenient to model the photon arrivals at different 
frequencies, Nt, as an inhomogeneous Poisson point process (or Poisson random measure): 

Nt{-) - |aPP (A) i.i.d. for all f. 
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yt = Nt (Bt) Poisson A (u) diy^ . 


Also, without making any strong assumptions, it is convenient to assign the prior distribution for e (v) —in a 
nonparametric way—as a Gaussian process (GPj^ e (z^) GP (0, k), where k = k{v, u') is the covariance 
function (or kernel function). In this paper, we assume that k is given, fixed, and coming from a parametric 
family. We can also conduct a full Bayesian inference on the parameters of k by using the inferential 
techniques we introduce in this paper, but for the ease of demonstration we choose not to address this point 
further. 

Later on in the simulation we will conventionally make a stronger assumption that k [v, v) = cj^ for 
any v, which might not be true in practice. To have a more accurate representation of the reality about k, we 
can perform a maximum likelihood fitting for k from a flexible covariance sfrucfure based on existing ob¬ 
servations. One anonymous reader also mentioned that cr^ should be small for observing a well-understood 
object, as the deviation term is essentially unnecessary. Hence, even though our Bayesian framework will 
allow the flexibility that the truth can deviate from the selected templates, it will also be compatible with 
the frequentist methodologies used by astronomers, which could provide strong predictive power for the 
astronomical objects. 

We finally assign a prior on cj as a; ^ Dir (q) by incorporafing some asfronomy prior knowledge, 
which we choose nol to discuss here for the clarity of presentation. Another way of choosing a is from an 
empirical moment matching on existing observations. In practice, m might be very large, so we could also 
choose OL to encourage sparse u>. 

Thus, in terms of a Bayesian graphical (or hierarchical) model representation, shown in Figure the 
main parameter of interest—the unknown SED A—has a nonparametric prior distribution that is a log- 
Gaussian process with a mean function depending on uj and a known covariance function. Levels 1-3 


together specify a mixture (over cj) log-Gaussian Cox process (LGCP) on Nt (•) (see Ghahramani et al. 


1 2006) , Lawrence and Moore | 2007| , Rue and Martino |2009| for its generalization), which we will use 
throughout this paper as the main model of interest. 


3.2 Design Objective 


Lollowing the seminal work by [Lindley |1956|, we consider the expected gain in Shannon information 
([Shannon 


119481) as the utility function. Precisely, our design goal in BNS-ED is to sequentially choose a 


design that maximizes the expected Kullback-Leibler divergence between the posterior distributions of u 


^The choice of GP here is purely conventional. It is possible to replace this nonparametric prior by any stable process (e.g. a 
Cauchy process) if alternative tail behavior is desired. Our SMC approach will easily scale to such alternative specifications. 
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at time t and t — 1: 


EIGt {DKL{pHyt,Bt,Et-i)\\pHTt-i))), 


( 1 ) 


where Et-i = a {yi, Bi, yt_i, Bt-i) denotes all the historical information before the f-th observation. 
To summarize, we want to have a series of design decisions that extract the most information for u after 
each observation. 

Note that here we do not study the information gain directly on A since our astronomical goal in this 
paper is more to analyze the unknown truth with the existing templates. We wish that the inclusion of the 
deviation term in some cases (like Example 2 in Section can provide some signals of the existence of 
deviation, but, when there is no deviation, we still wish to get more information about u). 

By using 


/ I T' A ^ vyc t—i? / IT' 

p{u;Et) = — — —p{uEt-i 

P[yt\Bt,Et-i) 

the expected information gain in equation ([T]l can be further rewritten as 


( 2 ) 


EIGt{Bt) {DKL{p{yt\Bt,Et-i,uj)\\p{yt\Bt,Et-i))). (3) 

The problem now reduces to approximating the expectation in equation Q, which will be approached with 
sequential Monte Carlo techniques. 


3.3 Sequential Monte Carlo (SMC) Inference 


We employ the SMC procedure to approximate each of the posterior distributions p by a set of 


particles , V'l-i | ■ Thus, to approximate the expected information gain in equation (31, we can use 

p I 


N 


N 


EIGt{Bt)^Y.^tiY.^og 

yt=o 




i=l 


E^=i ^t-iP ( yt\Bt, 




(*') 


p{yt\Bt,Ft-i,u:[]l^ 


(4) 


where the summation over yt can be further narrowed as we describe in Subsection |4.3| 

Based on this approximation, we choose Bt to maximize the approximated expected information gain 
from the set of available filters (Bt = argmax EIGt {B)). Using this filter, we acquire a new observation 

BSFilters 

yt. Then, according to equation Q, we update the particles at time t via 


u 


(*) 


= oc V'l-i xpiyt 


di) 


id 


Bt,Ft-uufl, 


for i = 1, Every time the particles are updated as above, we monitor the effective sample size of 

the particles. Once the effective sample size drops below a given threshold, we resample from the current 
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particles. Furthermore, in order to increase the diversity of the particles, we perturb each resampled particle 
by a Markovian move. To achieve this, we first note that 

t 

p{u;\J^t) OC Y\p {ys\Bs, P {^) ■ 

S = 1 

Then, to sample from p{uj\Tt), we just propose a new for each i = 1, ...,N by drawing from 

Dir where r is a tuning parameter representing the step size, and then accept the proposal with the 

usual Metropolis-Hastings acceptance probability. Algorithm[T]in the Supplementary Materials [Cjdescribes 
the complete methodology. 


4 Efficient Computations for LGCP 


The SMC procedure described above relies heavily on a fast and accurate way to calculate the following 
posterior predictive distribution for yt using the filter Bt: 

jp{yt\Bt, p) 0^=1 p{ys\Bs,p)p{p\ij^)dp 


p{yt\Bt,Bt-i,u) = 


Ul=\p{ys\Bs,J^s-i,i^) 


(5) 


where p = log (A). This might be achieved by using a vanilla Monte Carlo estimate (see [Meeds and Welling 
1 2014) for a recent development of this kind of simulation-based technique). However, due to the functional 
nature of a Gaussian process, the sample space in calculating equation Q is too large for vanilla Monte 
Carlo integration to be computationally efficient. Furthermore, this large scale Monte Carlo estimate needs 
to be repeated N times for each observation time t, which clearly slows down the performance of Algorithm 
[T] An alternative calculation of the posterior predictive distribution will be introduced in the following two 
Subsections. We call this new approach the Poisson log-normal approximation (PoLNA) (see e.g. ams| 
et al. 120091 and Simpson et al. |2013[ for some common ways to infer LGCP). In the third Subsection, we 


apply PoLNA to reduce the amount of computations needed to calculate the Kullback-Leibler divergence. 

4.1 Poisson Log-Normal Approximation (PoLNA) 

The key idea of this approximation is to reduce the dimensionality of the integral in equation Q by finding 
fhe joinf disfribufion of 

A = (Al, ^ (A (Hi, 7?),..., A {Bt, p)f 4 ^ ^ , 

so we will have 

» t-i „ t-i 

/ p{yt\Bt,p)'[\p{ys\Bs,p)p{p\uj)dp = / Poisson (t/i I At) n Poisson {ys | ) p (A|a;) dA. 


S=1 


S=1 
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Simulation studies (where a simple example for i = 1 is shown in Figure]^ of Supplementary Materials 
[B] ) indicate that the joint distribution of log A can be approximated by a multivariate normal distribution 
M (/^Aj ^a) with a high accurac)0, where and X)a can be obtained either from a Monte Carlo estimate 
or from a deterministic numerical calculation using the extended technique of |Safak 119931. The detailed 
calculation for the later proposal can be found in the Supplementary Materials [B] 

Hence, by plugging into this approximation, the posterior predictive distribution becomes the condi¬ 
tional distribution of a multivariate Poisson log-normal distribution, PLN (/^, X)), which has a joint proba¬ 
bility mass function 


PLN(y|/r, S)= / Poisson (y* jA* jA/j (log A|/x, E) d (log A). 


( 6 ) 


S=1 


The (multivariate) Poisson log-normal distribution has several tractable properties such as analytical formu¬ 
las for its mean vector or covariance matrix, unimodal feature, and subexponentially decaying tail. It has 


been studied in depth by Aitchison and Ho 119891 and Perline 1 1998||. 


4.2 Laplace Transform Approximation of Multivariate Log-Normal Distribution 


Equation Q is a low dimensional integral (usually there are only 10 different filters available on the tele¬ 
scope), so one common way to achieve this type of numerical integration is from the multivariate Gaussian 
Hermite quadrature, which does not take full advantage of the special form of the integrand—a product of 
several Poisson likelihoods. Thus, in this Subsection, we will introduce an alternative way to approximate 
equation <01 using an approximation to the Laplace transform of a multivariate log-normal distribution. 

First, we rewrite <0 as 


PLN y|A,5] = 


gS^SS/2+/i'^S 

TT^ I 


■logA/j, ( A 


A + 5]S,5] dA 


(V) 


where A^’s are the unique components of A with corresponding multivariate log-normal parameters, Ji and 
S, and 


Vsn ^ i^^Poisson i.i.d. for all n = 1,..., n^, s = 1,..., tt; 

/ ni \ ^ 

y = S = ] TO. = {ui, ...,nuf'■ 


\n=l 


n=l 


Thus, the multivariate Poisson log-normal distribution is fully characterized by the Laplace transform of a 
multivariate log-normal distribution. 


^For notation clarity, in this Subsection we allow Sa to be degenerate, i.e., some of the As’s might have correlation 1. 






















Employing the same technique presented by Asmussen et al. |20131 (we omit the detailed proof here), 
we can derive a sharp approximation to the Laplace transform of a multivariate log-normal distribution as 


exp 


-n^ A 


logAAM ( A 


/i + SS,^ dA 


(M)^S-iW„ (M) 


T^-l 




w„ (M) 


Y^det (itt + Mdiag 


( 8 ) 


where M = S diag (n) diag and (M) is the multivariate Lambert W function defined as 

the unique solution of Mexp (—W„ (M)) = (M). This approximation is derived via the Laplace 

approximation in an asymptotic sense but it stays sharp over the entire domain of convergence of the Laplace 
transform. Combining equations Q and ([^ gives us an accurate and fast approximation to the multivariate 
Poisson log-normal distribution. 


4.3 Efficient Calculation of Kullback-Leibler Divergence 


In this Subsection, we will propose an efficient calculation scheme for equation Q by limiting the Kullback- 

Leibler divergence calculations over an effective range of yt. 

('2') ri 

Lor a given particle and a filter Bt, the inner summation in equation (41 can be well approximated 

by summing only over those yt in the (1 — a)-interval of p (jjt Bt, —thanks to the uni modal 

property and the subexponentially decaying tail of a Poisson log-normal distribution. We usually set a = 
5%. The remaining question now is how to find fhese fwo quanfiles for fhe conditional Poisson log-normal 
disfribufion. We will derive a pair of conservafive bounds by focusing on fhe univariate Poisson log-normal 
disfribufion PLN {yt\lJ-tt^ ^tt) since fhe unconditional disfribufion will be falter lhan fhe conditional one. 

As Perline | |1998} sfafes, fhe Poisson log-normal disfribufion has an upper fail asympfofically equal fo 
fhe upper fail of fhe log-normal disfribufion, so 


yt=M+l 

Hence, we lef fhe upper bound and lower bound fo be 


1 _ $ 


logiWj^ 

V^tt 


(9) 


Mu = 
Ml = 


exp I Zi_o/2 


-0/2 + 




+ 1) 


exp 




( 10 ) 

( 11 ) 


where \_x\ means fhe infeger pari of a real number x and Za is the a-quantile of a standard normal distribu¬ 
tion. 

When a is small, we expect the true upper quantile for PLN {yt\pt^ ^tt) to be large, so the tail approx¬ 
imation in equation Q is particularly accurate, and likewise for Mu in equation ([TO]). On the other hand. 
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when is moderate, the true lower quantile is not far away from 0, which is usually the answer given by 
equation ( [TT] ). When is quite large, even the true lower quantile can be extreme; in this case the tail 
approximation in equation Q will become accurate again, so equation ( [TT] ) is still valid. 

Simulation shows that both Mjj and Ml offer practically useful guidance for finding the upper and 
lower quantiles for PLN (r/tl/x^, S^i), which then can help us to reduce the total amount of calculations in 
equation 0. 

5 Simulation 

In this Section we discuss two simulation examples, one using two trigonometric templates to assess our 
algorithm and the other using three real templates from astronomy. Here we only focus on comparing differ¬ 
ent strategies for sequential design and demonstrating the faster speed of our methodology. The comparison 
between our Bayesian methodology and the existing frequentist inference methods is important but not our 
main focus here. Besides, those existing methods employed in astronomy community does not allow the 
on-line sequential learning, so neither do they have equal status to compare with our method. 

In the following examples, we only study three different types of strategic^ the proposed sequential 
Monte Carlo strategy (SMCS) using Algorithm [TJ the totally random strategy (TRS), by which the choice 
of the filter is totally by chance; and the greedy strategy (GS), which we deterministically choose the filters 
in the same order as the absolute differences between the integrated intensities |A (H, /r^) — A {B, ^ 2 ) \ for 
different filter B and two templates /ri,/r 2 - Clearly, GS only works when we have two template models. 
We note that TRS and GS are indeed the current methodologies employed by astronomers. 

This simulation section is primarily an illustration, for the real data includes more domain-specific 
fechnicalifies and will affecf fhe clarify of our presenfafion. These empirical resulfs will be included in a 
follow-up paper. Nofe fhaf fhe oufcomes and fakeaways are similar for fhe real dafa as for fhe simulation— 
SMCS will clearly perform heller lhan fhe TRS, which is commonly used in aslronomy. Hence, fhe primary 
change in fhe real selling will be fo include a larger lemplale base (over 100 femplafes), bul our melhodology 
can easily adapl fo Ibis much more complicale selling. 

Example 1: Exponential Trigonometric Templates We use two templates (m = 2) and let 

^tme (^) — (^true (^)) — ^l,turef^l (^) T ^2,trueP'2 (^) ) ^true — (O-S, 0.2) , 

1^1 (^) = 2 sin {Ittv) + 4, fi 2 (^) = 2 cos (27rz^) -|- 4, u € [0,1] , 

"'One anonymous reader once suggested to add the comparison with GP-UCB strategy proposed hy 
our simulation, but, as discussed in Section [60] their setting is different from our work here and hence GP-UCB is not directly 
applicable. 


Srinivas et al. | |2010| in 
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(a) 95% posterior probability interval of wi. (b) Posterior density of wi at f = 10. 

Figure 2: A time-varying demonstration for the posterior distribution of wi for Example 1. 


^ , CT = 0.2, I = 0.02, 

and finally let the range of yt for calculating equation Q come from ( [T0| and ( [TT] ). The ten filters available 
here have frequency range [0, 0.1], [0.1, 0.2],..., [0.9,1]. For each strategy (TRS, GS, and SMCS), we run 
the simulation up to f = 10. 

The 95% posterior probability intervals of tui for each time t can be found in Figure]^ The probability 
interval of wi at f = 0 comes from the uniform prior on [0,1]. Both SMC and GS converge faster and give 
narrower intervals (since f = 1) than TRS. Figure [^ also shows that both SMCS and GS give a narrower 
and less biased result than TRS. In this case, SMCS performs slightly better than GS in terms of the root 
of posterior mean square error (6.6% for TRS, 6.0% for GS, and 5.5% for SMCS). Recall that GS is valid 
only when m = 2, but the proposed SMCS can be applied to other cases. Actually, the generalization of 
GS for the case of m > 2 is our very first motivation to study this work. 


/c(z^,u') = 


cj^ exp 


Example 2: Active Galactic Nuclei (AGN), Composite (COMP), and Starburst (SB) Templates In 

this example, we set m = 3 and use three real templates from astronomy: AGN NGC5506, COMP IRAS 


19254-7245, and SB NGC 7714 (Richards et al. 120061, Elvis et al. |1994|, Hopkins et al. |2007|). The 


frequency is scaled to [0,1] with the set of filters being the same as in Example 1. Now let 


%rue (^) — ^l.turefiAGN (^) + <^2,tme/rcOMP {^) + Wayruef^SB {^) ■ 

The shape of /tagN’ Fgomp Fsb be found in Figure [4^ We consider two cases to demonstrate the 
importance of including a Gaussian process (GP) component in the modelling of the unknown rj (i^). 


11 































-TRS without GP TRS with GP -TRS without GP TRS with GP 

True vaiue - SMCS without GP SMCS with GP True vaiue - SMCS without GP SMCS with GP 




(a) 95% posterior probability interval of wi. (b) 95% posterior probability interval of UJ 3 . 

Figure 3: A time-varying demonstration for the posterior distribution of cj in the Case 1 of Example 2. 

Case 1: Correctly specified templates. Let cjtrue = (0.6,0.2,0.2)^ and use all the three templates 
(Fagn> FcomP’ /Tsb) foi" estimation, in this case is also plotted in FigureWe compare SMCS 
and TRS under two scenarios: modelling rj with a GP prior and without a GP prior. For the first scenario, 
we choose k the same as in Example 1. The 95% posterior probability intervals of uji and cus are shown in 
Figure]^ SMCS without a GP prior results in the narrowest interval because r?true correctly specified by 
the templates. On the other hand, no matter whether a GP prior is used or not, SMCS converges faster than 
the TRS. 

Here we emphasize that the slow convergence in Figure does not come from the approximation error 
of PoLNA (as the two cases without GP does not require any approximation) but instead from the small dif¬ 
ferences of the integrated intensities (for each filter) among the templates. It seems that the three templates 
differ significantly in Figure]^ but their actual integrated intensities do not. 

Case 2: Misspecified templates. Now we let untrue = (0,0,1)^, so = /rgg. In this case, we 
only use /U^qn /tcomp foi" estimation, so any simple convex combinations of /tagn /Tcomp 
still far away from the truth. We compare the SMCS with and without a GP prior using the same k as in 
Example 1, and in Figure|^we plot the 95% posterior region of 7 / at f = 10 (marginally for each u). SMCS 
with a GP prior outperforms SMCS without a GP prior, because the posterior region of the former is more 
capable of covering 7?true- Precisely, with a GP prior on rj there is a 27% posterior probability that 

11^ r/truelloo — IIFCOMP r/tj-uelloo ’ 

where H/Ho^ = sup {|/ (i^)] : z/ E [0,1]} and ^comp fhe most achievable estimation of i] without using 
a GP prior. A GP prior on rj allows more adaptation to the data, which leads to a more reliable conclusion— 
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(a) The three real templates from the astronomy. Also (b) 95% posterior region of rj when f = 10 in the Case 
shown here is study in the Case 1. 2. 

Figure 4: Illustrations of Example 2 in terms of log-SED’s. 


even though the templates we use for estimation cannot completely describe the truth. This proves the 
potential and the practical value of our methodology. 


6 Discussion 

6.1 Differences with related works 


Bayesian experimental design versus Bayesian optimization. The use of Gaussian process (GP) to 
sequentially choose the next experiment to perform so as to optimize some value of information can be 


traced back to Kushner 119631. The research alone this line is generally called Bayesian optimization. More 
literature reviews in this direction can be found in the survey papers [Brochu et ah |2010| and Erazier 12010|. 
In this context, |Villemonteix et ah] | |2009C also uses an information theoretic criterion similar to our work. 

However, our formulation of Bayesian sequential experimental design is fundamentally different from 
Bayesian optimization. Optimization aims to find the optimum of a given target (a local criterion); experi¬ 
mental design seeks to characterize the entire target distribution (a global criterion), as measured by some 
(set of) aggregate information metric(s). Hence, the focus of our approach is not the same as those Bayesian 
optimization literatures. 


Sequential design versus non-sequential design. There are indeed some literatures study experi¬ 


mental design with global criterion using the Bayesian nonparametric GP prior. Eor example. Sacks et al. 


1 1989| seeks to minimize integrated mean square error, while Shewry and Wynn |1987| and [Currin et al. 


| |1991) want to maximize the entropy of the posterior. However, these papers do not study the sequential 
procedure, so they are different from our paper. 
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As far as we know, only few existing papers aim to globally understand a target, studying sequential 
design and Bayesian nonparametric estimation at the same time, which is what we have done in this paper. 
The most matchable literature we know so far regarded to the intersection between Bayesian nonparametrics 


and Bayesian sequential experimental design is Ferreira and Sanyal 120141, which still differs a lot from our 
work not only in the scientific goal, in the design problem formulation, but also in the inferential techniques. 


Non-conjugate model versus Gaussian conjugate model. Even though Ferreira and Sanyal 1 2014 1 


is the most similar literature to our work so far, they consider only the conjugate GP model with continuous 
Gaussian measurement error for the observations. Actually, most of the existing GP literatures (whether 
focusing on Bayesian optimization, active learning, or other problems) assume the Gaussian conjugacy in 
their model setting, in which case the inference techniques are still based on closed analytical forms. 

However, our approach is capable of dealing with non-conjugate Bayesian nonparametric models. Our 
paper not only adopts a non-conjugate normal-Poisson model but also provides an efficient inference tech¬ 


nique (SMC) that can generalize to other nonparametric priors. Even though Gramacy and Polsona |20111 
also study SMC inference for the sequential design on a GP, they neither focus on a global criterion to 
understand the target nor use a non-conjugate model. 


Design versus inference in the non-conjugate hierarchical Bayesian nonparametric model. Our 

paper studies sequential active learning in addition to performing inference in a non-conjugate hierarchical 
Bayesian nonparametric model. While many papers have studied the inference portion of this task (usually 
based on EGCP, such as Rue and Martino 1 2009[ and Simpson et al. |2013|), none have simultaneously 
tackled the design problem, largely due to the huge computational cost. In this paper, we have created an 
efficient computational approach to solve this problem, demonstrating its usefulness on an astronomical 
application. 


6.2 Conclusion 


In this paper, we first study the problem of Bayesian nonparametric sequential experimental design (BNS- 
ED) without conjugacy. We build a three-level hierarchal Bayesian nonparametric model that aims to find 
the optimal astronomical observational strategy. A sequential Monte Carlo (SMC) strategy is then proposed 
to solve the problem of interest. To overcome the computation hurdle inherited naturally from the log- 
Gaussian Cox process (EGCP), we exploit the special features of this model and provide a new inference 
technique for it, called Poisson log-normal approximation (PoENA), which can still be applied even for 
spatial-temporal EGCP. 
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We would like to emphasize that even though we mainly focus on the widely studied LGCP in this 
paper, the BNS-ED framework we discuss here and the corresponding inferential scheme, which employs 
sequential Monte Carlo techniques, can be easily generalized to other nonparametric models. 

The computation problem encountered in this paper is generally difficult, so we suspect that this is why 
there is no existing work on this topic—sequential design for globally learning a nonparametric target with 
non-conjugate sampling distribution. Our computational technique provides the algorithmic speed to make 
this idea practical on real applications, which then justifies fhe novelfy and value of our approach. 
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A Derivations 

A.l Equation @ 


By Bayes Theorem, 


p {yt, Bt\u, P 
P {yt, Bt\Bt-i) 


p{yt\Bt,u;,Tt-i)p{Bt\u,Tt-i)p{uj\Tt-i) 

p{yt,Bt\Tt-i) 
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p{yt,Bt\Bt-i) p{yt\Bt,Bt-i) ^ ^ ’ 

where the third line follows from p {Bt\u;, Tt-i) = p {Bt\Bt-i) because by definition Bt and u are inde¬ 
pendent given Bt-i- 


A.2 Equation @ 

Let us simplify the expected information gain EIGt {Bt) in equation Q. We note that 


p{u\yt,Bt,Et-i) 
p{uj\Ft-i) 

p{yt\Bt,u,Ft-i) \ p{yt\Bt,u,Ft-i) 


p{u\yt,Bt,Ft-i)<iu> 

p{u\Ft-i)du, 


DKL{p{^\yt,Bt,Ft-i)\\p{u\Ft-i)) = y log 

P{yt\BuFt-i) J p{yt\Bt,Ft-i) 

where the second line follows from equation Q. Therefore, by swapping the integral and the summation, 

OO 

EIGtiBt) = Y.DKL{p{u^\yt,Bt,Ft-i)\\p{u;\Ft-i))p{yt\Bt,Ft-i) 
yt=o 

'p{yt\Bt,u:,Ft-i)' 


/ 

j/t=o 


p {yt\Bt, u, Ft-i)p {u\Ft-i) duj 


p{yt\BuFt-i) 

= {Bkl {p{yt\Bt,u>,Ft-i) \ \p {yt\Bt, Ft-i))) • 


A.3 Equation Q 

Recall that 


p{yt\Bt,Ft-i,u) = 


p{yt,Bt\Ft-i,u) ^ Jp {yt, Bt\Ft-i,p,u:) p {y\Ft-i,u)) dy 

p{Bt\Ft-i,u>) p{Bt\Ft-i,u) 

f (P iyt\Bt, Ft-i,y, uj) p {Bt\Ft-i,y, on) p {y\Ft-i,uj)) dy 


p{Bt\Ft-i,on) 

= j p{yt\Bt,y)p{y\Ft-i,u)dy, 


and 


p{y\Ft-i,uj) 


p {yt-i, Bt-i\y,Ft-2,i^)P {y\Bt-2,^^) 

p{yt-i,Bt-i\Ft-2,on) 

p {yt-i\Bt-i,y, Ft-2, 0 ^)P {Bt-i\Ft-2, P {y\Ft-2,uj) 
p {yt-i\Bt-i, Ft-2, on) p {Bt-i\Ft-2, on) 
p{yt-i\Bt-i,y)p{y\Ft-2,i^) _ _ Ui=iP{ys\Bs,y) 

p{yt-i\Bt-i,Ft-2,^) Y\l'l\piys\Bs,Fs-i,on)^ ^ 


Combining the two formulas above will give us equation Q. 
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A.4 Equation Q 


We have 


p / U Us \ 

PLN(y|/i,l:) = / nn Poisson (^log a| /i, Id (^log A^ 


vs=l n=l 


det (27rS 


- 1/2 


‘+S^Xg-(x-A) s l{x-/i)/2j^. 


m=lT\n=iysn\ J 

det (27rl:) “'^%(^+^®)^^’HA+ss)/2-/i^f:-i/i/2 


n:=in::Liy.n! 

-n^e^ -(x-/t-SS)^S-l(x-/i-SS)/2 


yje--e 

gS^SS/2+/t^S 
m TT^^S 


e- ^logAAj A 


dx 


/i + 5]S,5] dA, 


n:=in::Liysn! 

where ’s are the unique components of A with corresponding multivariate log-normal parameters being 
p, and I] and 

ysn\^s Poisson ^As^ i.i.d. for all n = 1,..., n^, s = l,...,rt; 


y = [y^rx]”Lir=i; s = 


Vun 


Kn=l 


n=l 


n = x= (^logAi,...,logA„^ 


B PoLNA Parameters Calculation 


In this Section, we will introduce the detailed procedures to calculate /x^ and Sa- After a fine discretization 
on the frequency space (z/), we will have 


V 

A{B,r,) 


Md (At^,K) , [t]]- = rj(u 


Xj) 


i=l 



[K],,, = 


V J 


L V ’ J\ 




Aj)&B 


j,j' = 


D is the number of discretization. 

The distribution of A {B, rj) is approximated by a log-normal distribution. We have 
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Also, r] + log follows a multivariate normal distribution. Our problem now is equal 

to the following one: 

Here, n = Let s* = log Approximately, 

For alH, 1 < < n*, define = Yl!j=i vl’ ®ni = •s*- Then 


si = log = sl_i + log ^1 + 


where = yl - si_^. 

For any two random variables X and Y, let yx — (^)’ ^‘x ~ (^)’ PxY — Cor (X, Y). 

Following |Safak| P'9931 , we have 


psl 

— jj i + Gi 

^k-i 


Fio® 

= Pyi - Psi_^ 


K 

— (J i — (7 i 

yl ^k-i 


o'^i 

^2 n'i 

- ^4 , “ Cl 

fc —1 



^4 

1 / 


Psi,yi 


I (Jr,,i (To 


[Psl_„yl^yl - ^si_^ 


Cs ( > Pw, 




J-1 


1 - 


Wj ! /^UI* ^ 


^^5 


a 




C*3 ) Pw'j ^ 


cr 


where Gl, G2, and G3 will be defined lafer. 

We have fwo mefhods fo compufe p i j . For any i, j and 1 < /ci < n* — 1, 1 < ^2 < n, — 1, suppose 
we already know p i j . Then we could firsl updafe p i j and fhen p i j . The sfrafegy is as 

^fcl’^fe2 ^fel+l’®fc2 ^ki+l’^k2+2 

follows: 


where 


( 42 - 


^2 




Pgi 


(4i - Fs* + log (1 + e '“I+M ) 42 - Psi 


°k2 


O'gi O' j 

fcl+1 % 


O ci 


PP J 


I 

+ Pk +1,4 

ki+l' K2 

“fei+i 




^si ,,^wi 

+1 fc ]^+1 


We could have similar formulas for p i j . 

^k-^+l’^k2+2 


19 















For a special case, if we have n* = nj and already know p^i , 1 < k < rii — 1 = rij — 1, we will have 


E 


4 - k-si + log (l + 


4 “ ksi + log (i + 


-Gl (T^ 


P. 


.i c;j 
7c+l’*fc+l 




k + ly 


-G 


1 i M 

^ \ wi , ^ ^ w 


'fc+i ^fc+i/ 


CT^i O' 


'k + 1 


(J ai O' j 
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^ j G 3 




^ /c j 

Pc* J r P i J 

^k’^ka^i a j “'fe+i-^fe a^i aj a,„i 

*fc+i «fc+i *’ ■ ■ 


+ Pv 


^siG3(<y^j ik^i ) 

k \ ^fc+1 “'fc + l/ 


9» 

,i G j U ,,,i fc + 1 ’ k 

'fc+l Sj; + 1 ^k + 1 


O' j (Jgi (7 j 
■^fc+1 “fc+l ’"fc+l 


E (log (1 + ++.) log (1 + - Gi Gi j 




'fc+i “fc+i 


where E ^log ^1 + log + e^'=+i^^ could be computed with Gauss-Hermite quadrature. 

Wp ha VP 


We have 

<T 


$(x) = 

Ck = 


(_i)fc+i 
k ’ 


^ 2 (-i4 +' " 1 
^ ~ k + 1 




o 00 

A / 7Tl\ (7 ^ 

Gi(cj,m) = md) — H— -=e 2^ + > Ck[F {a,m,k) + F {a,-m,h)] , 


2(t‘^ 


02 ( 0 -, m) = (m^ + cT^) <h f—') + (m + log4)—^e i 

\ (7 / V 27r 

00 00 

+2 Ck {m — ka^) F (cr, m, k) + [-?" {o', m,k) + F {a, —m, k)] , 

1,-1 k=2 


k=l 
00 

G3{a,m) = o-^y~](-l)^ [F(o-,m, fc) 


Example 1: Exponential Trigonometric Templates, Continued Using the setting of Example 1, here we 
demonstrate an illustrative simulation to support the fundamental of our Poisson log-normal approximation: 
a sum of log-normal random vectors can be approximated by a log-normal random vector again. 

When f = 1, a histogram of log A {Bi,p) with Bi = [0, 0.1] is shown in Figure]^ where 10,000 Monte 
Carlo GP paths p are drawn and we calculate 

A{Bi,p)= [ e^l^ldz2. 

Jbi 
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Figure 5: The histogram of logA{Bi,r]) for Bi = [0,0.1] as in Example 1 with 10,000 Monte Carlo 
samples. Also shown is the fitting of a normal distrihution. 


One ean see that the normal distrihution does quite a good joh to approximate the distrihution of log A {Bi , rj), 
whieh means that A {Bi,r]) ean he approximated hy a log-normal distribution. This is aetually a general 
phenomenon, even valid for the multivariate ease. 

C Sequential Monte Carlo Algorithm 
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Algorithm 1 SMCS for BNS-ED 

Require: 

N: the number of particles; 

M[/: the upper bound of observations; 

Ml: the upper bound of observations; 

c: the threshold of effective sample size (ESS); 

e: the threshold of information gain; 

Filters: the set of available filters and their band widths; 
a: the prior parameter of the model weights; 
r: the step size of the Markovian move. 


1 : Draw Dir (ct), i = 1,N. 

2: V’? ^ = 

3: for t = 1,2, 3 , ... do 

4: for i = 1 ,..., N, ut = Ml, ■■■, Mu, ut G Filters do 



5 


6 : end for 

7: Calculate EIGt {i^t) with eq. Q, vt G Filters. 

8 : Vt ^ argmaxF/Gi (v). 


j/gFilters 


9: Ut ^ Observation at time t. 



14: 


for i = 1 ,..., N do 


16 

17 


15: 
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18 

19 


end for 

// Markovian sampling. 

for i = 1,..., N do 



21 : 


23 

24 

25 

26 


22 : 


^ with probability A, 
1 *^ otherwise. 


end for 
end if 




28: if/Gt<ethen 

29: Break. 

30: end if 

31: end for 
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